General Overview

  • We use the snakemake workflow management system[1,2] for:
    • Maintaining reproducibility in technical validation and regeneration of results.
    • Creating scalable data analysis scaled to server, grid or cloud environment.
    • Fostering sustainable improvement of the microbiome data analysis.
  • We also review existing workflows[2,3] to help in gaining a better insights for improving microbiome data analysis.
  • We break any complex workflows into small contiguous but related chunks where each major step form a separate executable snakemake rule.


We envision to keep fostering on continuous integration and development of highly reproducible workflows.



Current project tree

.
├── LICENSE.md
├── README.md
├── config
│   ├── config.yaml
│   ├── pbs-torque
│   ├── samples.tsv
│   ├── slurm
│   └── units.tsv
├── dags
│   ├── rulegraph.png
│   └── rulegraph.svg
├── data
│   ├── metadata
│   ├── mock_control
│   ├── mothur
│   ├── reads
│   ├── references
│   ├── resources
│   └── test
├── images
│   ├── bioinformatics.png
│   ├── bkgd.png
│   ├── bkgd0.png
│   ├── cicd.png
│   └── smkreport
├── imap-bioinformatics.Rproj
├── index.Rmd
├── library
│   ├── apa.csl
│   ├── export.bib
│   ├── imap.bib
│   └── references.bib
├── report.html
├── resources
│   ├── final.biom
│   ├── final.lefse
│   ├── final.shared
│   ├── final.taxonomy
│   ├── metadata.csv
│   ├── sample.final.shared
│   └── samples.tsv
├── results
│   └── project_tree.txt
├── styles.css
└── workflow
    ├── Snakefile
    ├── envs
    ├── rules
    ├── schemas
    └── scripts

22 directories, 28 files



Current snakemake workflow

  • Typically the snakemake workflow is defined by specifying rules.
  • The rule-graph graphically shows the input-output files.
  • The snakemake is capable of automatically determining the dependencies between the rules and creates a dot-like DAG (Directed Acyclic Graph).




Screenshot of interactive snakemake report

The snakemake html report can be viewed using any compartible browser, such as chrome to explore more on the workflow and the associated statistics. You will be able to close the left bar to get a better wider view of the display.




Microbial Profiling Using Mothur

Mothur reference databases

  1. Mothur-based SILVA reference files[4].
  2. Mothur-based RDP reference files[5]. Note: The RDP database is to classify 16S rRNA gene sequences to the genus level.
  3. ZymoBIOMICS Microbial Community Standard (Cat # D6306)[6]. The ZymoBIOMICS Microbial Community DNA Standard is designed to assess bias, errors and other artifacts after the step of nucleic acid purification.

There are four methods that can be used to profile microbial communities present in a sample. Here we briefly decribe each method:

1.Classify OTUs

  • OTUs (Operational Taxonomic Units (OTUs)) are clusters of similar sequences and are commonly accepted as analytical units in microbial profiling when using 16S rRNA gene markers.

2. Classify Phylotypes

  • A phylotype in microbiome research is a DNA sequence or group of sequences sharing more than an arbitrarily chosen level of similarity of a 16S rRNA gene marker.

3. Classify ASVs

  • ASVs Amplicon Sequence Variants (ASVs)in microbiome research is any inferred single DNA sequences recovered from a bioinformatics analysis of 16S rRNA marker genes.
  • ASV is typically really a cluster of sequences that are one or two bases apart from each other.

4. Classify Phylogenies

  • Microbial phylogenies are from gene sequence homologies. Models of mutation determine the most-likely evolutionary histories.



Preliminary analysis using Mothur

The preliminary analysis (alpha_beta_diversity rule) is part of the bioinformatics analysis. It includes:

  • Creating reads count for each group.
  • Subsampling for downstream analysis.
  • Rarefaction.
  • Computing Alpha diversity metrics.
  • Computing Beta diversity metrics.
  • Getting sample distances.
  • Constructing sample phylip tree.
  • Generating ordination matrices including PCoA and NMDS.


Citation

Please consider citing the iMAP article[7] if you find any part of the IMAP practical user guides helpful in your microbiome data analysis.




Appendix

Troubleshooting (in progress)

  1. Are chimeras removed by default in newer versions on mothur?
    • Yes. Chimeras are removed by default. You can still run the remove.seqs command without error, but it is not necessary. Remove chimera sequence explained here
    .
  2. Mothur dist.seqs taking too long.
    • Merged reads are too long, probably over 300pb.
    • Reads not overlaping when merging the paired reads.
    • Too many uniques representative sequences probably caused by lack of overlapping.
    • No enough computer power which suggest a use of HPC or Cluster.




References

[1]
Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., … Nahnsen, S. (2021). Sustainable data analysis with snakemake. F1000Research, 10. https://doi.org/10.12688/f1000research.29032.2
[2]
Snakemake. (2023). Snakemake. Retrieved from https://snakemake.readthedocs.io/en/stable
[3]
Close, W. L. (2020). Mothur 16S v4 analysis pipeline. Retrieved from https://github.com/wclose/mothurPipeline
[4]
Mothur-based silva reference files. Retrieved from https://mothur.org/wiki/silva_reference_files/
[5]
Mothur-based RDP reference files. Retrieved from https://mothur.org/wiki/rdp_reference_files/
[6]
ZymoBIOMICS microbial community DNA standard (cat # D6306). Retrieved from https://www.zymoresearch.com/zymobiomics-community-standard
[7]
Buza, T. M., Tonui, T., Stomeo, F., Tiambo, C., Katani, R., Schilling, M., … Kapur, V. (2019). iMAP: An integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics, 20. https://doi.org/10.1186/S12859-019-2965-4